Abstract: Text categorization is an important and well-studied area of pattern recognition, with a variety of modern applications. Effective spam email filtering systems, automated document organization and management, and improved information retrieval systems all benefit from techniques within this field. The problem of feature selection, or choosing the most relevant features out of what can be an incredibly large set of data, is particularly important for accurate text categorization. The proposed system (i) use well known pre-processing method porter and Lancaster for train the dataset. (ii) A number of feature selection metrics have been explored in text categorization, among which information gain (IG), chi-square (CHI), Mutual information (MI), Ng-Goh-Low (NGL), Galavotti-Sebastiani-Simi (GSS), Relevancy Score (RS), Multi-Sets of Features (MSF) Document frequency (DF) and odds ratios (OR) are considered most effective. Pruning techniques are also proposed using ignore the feature based on TF and DF to further reduce the set of possible features (typically words) within a document prior to applying a method of feature selection. (iii) Finally classify the selected feature based on two algorithm KNN and Navie bayes. Two benchmark collections were chosen as the testbeds: Reuters-21578 and small portion of Reuters Corpus Version 1 (RCV1). The two classifiers and both data collections, and that a further increase in performance is obtain by combining uncorrelated and high-performing feature selection methods.
Keywords: Locally Weighted Spectral Cluster, matrix, Local Scaling, Estimating Weight based Clusters